NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Faster Learned Sparse Retrieval with Guided Traversal

https://doi.org/10.1145/3477495.3531774

Mallia, Antonio; Mackenzie, Joel; Suel, Torsten; Tonellotto, Nicola (July 2022, ACM)

Full Text Available
Efficiency Implications of Term Weighting for Passage Retrieval

https://doi.org/10.1145/3397271.3401263

Mackenzie, Joel; Dai, Zhuyun; Gallagher, Luke; Callan, Jamie (July 2020, Proceedings of the 43nd International ACM SIGIR Conference on Research & Development in Information Retrieval)

Language model pre-training has spurred a great deal of attention for tasks involving natural language understanding, and has been successfully applied to many downstream tasks with impressive results. Within information retrieval, many of these solutions are too costly to stand on their own, requiring multi-stage ranking architectures. Recent work has begun to consider how to “backport” salient aspects of these computationally expensive models to previous stages of the retrieval pipeline. One such instance is DeepCT, which uses BERT to re-weight term importance in a given context at the passage level. This process, which is computed offline, results in an augmented inverted index with re-weighted term frequency values. In this work,we conduct an investigation of query processing efficiency over DeepCT indexes. Using a number of candidate generation algorithms, we reveal how term re-weighting can impact query processing latency, and explore how DeepCT can be used as a static index pruning technique to accelerate query processing without harming search effectiveness.
more » « less
Full Text Available
Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

https://doi.org/10.1145/3397271.3401404

Lin, Jimmy; Mackenzie, Joel; Kamphuis, Chris; Macdonald, Craig; Mallia, Antonio; Siedlaczek, Michał; Trotman, Andrew; de_Vries, Arjen (July 2020, ACM)

There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.
more » « less
Full Text Available
PISA: Performant Indexes and Search for Academia

Mallia, Antonio; Siedlaczek, Michal; Mackenzie, Joel; Suel, Torsten (January 2019, Proceedings of the Open-Source IR Replicability Challenge)

Performant Indexes and Search for Academia (PISA) is an experimental search engine that focuses on efficient implementations of state- of-the-art representations and algorithms for text retrieval. In this work, we outline our effort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for the PISA system.
more » « less
Full Text Available
Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study

https://doi.org/10.1007/978-3-030-15712-8_22

MacKenzie, Joel; Mallia, Antonio; Petri, Matthias; Culpepper, Shane; Suel, Torsten (January 2019, European Conference on Information Retrieval)

Document reordering is an important but often overlooked preprocessing stage in index construction. Reordering document identifiers in graphs and inverted indexes has been shown to reduce storage costs and improve processing efficiency in the resulting indexes. However, surprisingly few document reordering algorithms are publicly available despite their importance. A new reordering algorithm derived from recursive graph bisection was recently proposed by Dhulipala et al., and shown to be highly effective and efficient when compared against other state-of-the-art reordering strategies. In this work, we present a reproducibility study of this new algorithm. We describe the implementation challenges encountered, and explore the performance characteristics of our clean-room reimplementation. We show that we are able to successfully reproduce the core results of the original paper, and show that the algorithm generalizes to other collections and indexing frameworks. Furthermore, we make our implementation publicly available to help promote further research in this space.
more » « less
Full Text Available

Search for: All records